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Abstract — Distributed data storage systems are essential to deal 
with the need to store massive volumes of data. In order to 
make such a system fault-tolerant, some form of redundancy 
becomes crucial. There are various overheads that are incurred 
due to such redundancy - most prominent ones being overheads in 
terms of storage space and maintenance bandwidth requirements. 
Erasure codes provide a storage efficient alternative to replication 
based redundancy in storage systems. They however entail high 
communication overhead for maintenance in a networked setting, 
when some of the encoded fragments are lost due to failure of 
storage devices and need to be replenished in new ones. Such 
overheads arise from the fundamental need in storage systems 
to recreate (or keep separately) first a copy of the whole object 
before any individual encoded fragment can be generated and 
replenished. Traditional erasure codes, originally designed for 
communication over lossy channels, are optimized for recreation 
of the original message (object), but not for regeneration of 
individual lost encoded parts. We propose as an alternative a new 
family of erasure codes called self-repairing codes (SRC) taking 
into account the peculiarities of distributed storage systems, 
specifically to improve the maintenance process. SRC has the 
following salient features: (a) encoded fragments can be repaired 
directly from other subsets of encoded fragments by downloading 
less data than the size of the complete object, ensuring that 
(b) a fragment is repaired from a fixed number of encoded 
fragments, the number depending only on how many encoded 
blocks are missing and independent of which specific blocks are 
missing. This paper lays the foundations by defining the novel 
self-repairing codes, elaborating why the defined characteristics 
are desirable for distributed storage systems. Then a concrete 
family of such code, namely, homomorphic self-repairing codes 
(HSRC) are proposed and various aspects and properties of the 
same are studied in detail and compared - quantitatively or 
qualitatively (as may be suitable) with respect to other codes 
including traditional erasure codes as well as other recent codes 
designed specifically for storage applications. 

Index Terms — coding, networked storage, self-repair 



I. Introduction 

Various genres of networked storage systems, such as decen- 
tralized peer-to-peer storage systems, as well as dedicated in- 
frastructure based data-centers and storage area networks, have 
gained prominence in recent years. Because of storage node 
failures, or user attrition in a peer-to-peer system, redundancy 
is essential in networked storage systems. This redundancy 
can be achieved using either replication, or (erasure) coding 
techniques, or a mix of the two. Erasure codes require an 
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object to be split into k parts, and mapped into n encoded 
fragments, such that any k encoded fragments are adequate to 
reconstruct the original object. Such coding techniques play 
a prominent role in providing storage efficient redundancy, 
and are particularly effective for storing large data objects 
and for archival and data back-up applications (for example, 
CleverSafe 0, Wuala flOl ). 

Redundancy is lost over time because of various reasons 
such as node failures or attrition, and mechanisms to maintain 
redundancy are essential. It was observed in lfl6l that while 
erasure codes are efficient in terms of storage overhead, main- 
tenance of lost redundancy entail relatively huge overheads. 
A naive approach to replace a single missing fragment will 
require that k encoded fragments are first fetched in order to 
create the original object, from which the missing fragment 
is recreated and replenished. This essentially means that for 
every lost fragment, /c-fold more network traffic is incurred. 

Several engineering solutions can partly mitigate the high 
maintenance overheads. One approach is to use a 'hybrid' 
strategy, where a full replica of the object is additionally 
maintained lfl6l . This ensures that the amount of network 
traffic equals the amount of lost datajj A spate of recent 
works 0, (6) argue that the hybrid strategy adds storage 
inefficiency and system complexity, besides being a single 
point of bottleneck. Another possibility is to apply lazy 
maintenance CD, 0, whereby maintenance is delayed in order 
to amortize the maintenance of several missing fragments. 
Lazy strategies additionally avoid maintenance due to tem- 
porary failures. Procrastinating repairs however may lead to a 
situation where the system becomes vulnerable, for example 
to bursty/correlated failures, and thus may require a much 
larger amount of redundancy to start with. Furthermore, the 
maintenance operations may lead to spikes in network resource 
usage (9). 

It is worth highlighting at this juncture that erasure codes 
had originally been designed in order to make communication 
robust, such that loss of some packets over a communication 
channel may be tolerated. Network storage has thus benefit- 
ted from the research done in coding over communication 
channels by using erasure codes as black boxes that provide 
efficient distribution and reconstruction of the stored objects. 
Networked storage however involves different challenges but 
also opportunities not addressed by classical erasure codes. 
Recently, there has thus been a renewed interest 0, 0, 
0> 0, El, 0, ifTTl in designing codes that are optimized 

'in this paper, we use the terms 'fragment' and 'block' interchangeably. 
Depending on the context, the term 'data' is used to mean either fragment(s) 
or object(s). 



to deal with the vagaries of networked storage, particularly 
focusing on the maintenance issue. In a volatile network where 
nodes may fail, or come online and go offline frequently, new 
nodes must be provided with fragments of the stored data to 
compensate for the departure of nodes from the system, and 
replenish the level of redundancy (in order to tolerate further 
faults in future). In this paper, we propose a new family of 
codes called self-repairing codes (SRC), which are tailored to 
fit well typical networked storage environments. 

As any linear (n, k,d) erasure code over a q-ary alphabet, 
a SRC is formally a linear map c : ¥ q k — > ¥ q n , s n- c(s) 
which maps a fc-dimensional vector s to an n-dimensional 
vector c(s). The set C of codewords c(s), s G ¥ q k, forms 
the code (or codebook). The third parameter d refers to the 
minimum distance of the code: d = min x ^ ye <7 d(x, y) where 
the Hamming distance d(x, y) counts the number of positions 
at which the coefficients of x and y differ. The minimum 
distance describes how many erasures can be tolerated, which 
is known to be at most n — k, achieved by maximum distance 
separable (MDS) codes. MDS codes thus allow to recover any 
codeword out of k coefficients. Though SRCs are not MDS 
codes, their definition mimics the MDS property in terms of 
repair, namely, we define the concept of self- repairing codes as 
(n, k) codes designed to suit networked storage systems, that 
encode k fragments of an object into n encoded fragments to 
be stored at n nodes, with the properties that: 

(a) encoded fragments can be repaired directly from other 
subsets of encoded fragments by downloading less data than 
the size of the complete object. 

More precisely, based on the analogy with the error correction 
capability of erasure codes, which is of any n — k losses 
independently of which losses, 

(b) a fragment can be repaired from a fixed number of encoded 
fragments, the number depending only on how many encoded 
blocks are missing and independently of which specific blocks 
are missing. 

Two families of SRC are known up to date lfl2l . |[T3ll . 

To do so, SRCs naturally require more storage overhead 
than erasure codes for equivalent fault tolerance (static re- 
silience). We will see more precisely later on that there is a 
tradeoff between the ability to self-repair and this extra storage 
overhead: SRC could be tuned to be MDS at the price of losing 
the self-repair property, and conversely, the facility to self- 
repair can be adapted based on the amount of extra redundancy 
introduced. Consequently, SRCs can recreate the whole object 
with k fragments, though unlike for erasure codes, these are 
not arbitrary k fragments, however, many such k combinations 
can be found (see Section HVl for more details). 

Note that even for traditional erasure codes, the property 
(a) may coincidentally be satisfied, but in absence of a 
systematic mechanism this serendipity cannot be leveraged. In 
that respect, hierarchical codes (HCs) [6] may be viewed as a 
way to do so, and are thus the closest example of construction 
we have found in the literature, though they do not give any 
guarantee on the number of blocks needed to repair given the 
number of losses, i.e., property (b) is not satisfied, and has no 
deterministic guarantee for achieving property (a) either. We 
may say that in spirit, SRC is closest to hierarchical codes - at 



a very high level, SRC design features mitigate the drawbacks 
of HCs. 

While motivated by the same problem as regenerating 
codes (RGC) and HCs, that of efficient maintenance of lost 
redundancy in coding based distributed storage systems, the 
approach of self-repairing codes (SRC) tries to do so at a 
somewhat different point of the design space. We try to mini- 
mize the number of nodes necessary to reduce the reconstruc- 
tion of a missing block, which automatically translates into 
lower bandwidth consumption, but also lower computational 
complexity of maintenance, as well as the possibility for 
faster and parallel replenishment of lost redundancy. Thus 
SRCs allow light weight (in terms of communication and 
computation overhead) and flexible (in terms of flexibility in 
the number of options to carry out specific repairs, which in 
turn allow parallel and fast repairs), that is, agile maintenance, 
of networked storage systems. 

In this work, we make the following contributions: 
(i) We propose a new family of codes, self-repairing codes 
(SRC), designed specifically as an alternative to erasure codes 
(EC) for providing redundancy in networked storage systems, 
which allow repair of individual encoded blocks using only 
few other encoded blocks. Like ECs, SRCs also allow recovery 
of the whole object using k encoded fragments, but unlike 
in ECs, these are not any arbitrary k fragments. However, 
numerous specific suitable combinations exist, 
(ii) We provide a deterministic code construction called Homo- 
morphic Self-Repairing Code (HSRC), showcasing that SRC 
codes can indeed be realized. 

(iii) HSRC self-repair operations are computationally efficient. 
It is done by XORing encoded blocks, each of them containing 
information about all fragments of the object, though the 
encoding itself is done through polynomial evaluation (similar 
to popular ECs such as Reed-Solomon lfl5l codes), not by 
XORing. 

(iv) We show that for equivalent static resilience, marginally 
more storage is needed than traditional erasure codes to 
achieve self -repairing property. 

(v) The need of few blocks to reconstruct a lost block naturally 
translates to low overall bandwidth consumption for repair 
operations. SRCs allow for both eager as well as lazy repair 
strategies for equivalent overall bandwidth consumption for a 
wide range of practical system parameter choices. They also 
outperform lazy repair with the use of traditional erasure codes 
for many practical parameter choices. 

(vi) We show that by allowing parallel and independent repair 
of different encoded blocks, SRCs facilitate fast replenishment 
of lost redundancy, allowing a much quicker system recovery 
from a vulnerable state than is possible with traditional codes. 
This also implies a distribution of the repair related tasks 
across different nodes, thus avoiding bottlenecks or overload- 
ing any specific node. 

II. Related work 

In O, 0, Dimakis et al. propose regenerating codes (RGC) 
by exposing the need of being able to reconstruct an erased 
encoded block from a smaller amount of data than would be 



needed to first reconstruct the whole object. They however 
do not address the problem of building new codes that would 
solve the issue, but instead use classical erasure codes as a 
black box over a network which implements random linear 
network coding and propose leveraging the properties of 
network coding to improve the maintenance of the stored data. 
Network information flow based analysis shows the possibility 
to replace a missing fragment using network traffic equalling 
the volume of lost data. Unfortunately, it is possible to achieve 
this optimal limit only by communicating with all the n — 1 
remaining blocks. Consequently, to the best of our knowledge, 
regenerating codes literature generally does not discuss how 
it compares with engineering solutions like lazy repair, which 
amortizes the repair cost by initiating repairs only when several 
fragments are lost. Furthermore, for RGCs to work, even sub- 
optimally, it is essential to communicate with at least k other 
nodes to reconstruct any missing fragment. Thus, while the 
volume of data-transfer for maintenance is lowered, RGCs are 
expected to have higher protocol overheads, implementation 
and computational complexity. For instance, it is noted in (7) 
that a randomized linear coding based realization of RGCs 
takes an order of magnitude more computation time than 
standard erasure codes for both encoding and decoding. The 
work of [14] improves on the original RGC papers in that 
instead of arguing the existence of regenerating codes via 
deterministic network coding algorithms, they provide explicit 
network code constructions. Recently, collaborative RGC were 
introduced independently JSJ, ifTTl . where it was shown that 
collaboration among new nodes joining the network and 
participating to the repair process can improve on traditional 
RGC, in terms of both (i) storage-bandwidth trade-off, that is 
the amount of data that is stored at each node with respect to 
that which is downloaded by new nodes during repair, and (ii) 
number of simultaneous failures tolerated. While such analysis 
determines constraint on achievability, for classical as well as 
collaborative RGC, code constructions for collaborative RGC 
are even sparser, up to date, only one construction has been 
given in [8|, which furthermore corresponds to the same repair 
cost as erasure codes. 

In [61, the authors make the simple observation that encod- 
ing two bits into three by XORing the two information bits has 
the property that any two encoded bits can be used to recover 
the third one. They then propose an iterative construction 
where, starting from small erasure codes, a bigger code, 
called hierarchical code (HC), is built by XORing subblocks 
made by erasure codes or combinations of them. Thus a 
subset of encoded blocks is typically enough to regenerate 
a missing one. However, the size of this subset can vary, from 
the minimal to the maximal number of encoded subblocks, 
determined by not only the number of lost blocks, but also 
the specific lost blocks. So given some lost encoded blocks, 
this strategy may need an arbitrary number of other encoded 
blocks to repair. Pyramid codes [11] explore similar ideas. 

III. Homomorphic Self-Repairing Codes 

In what follows, we denote finite fields by F, and finite 
fields without the zero element by F*. The cardinality of F 
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The finite fields F4, Wg and Fig, where w denotes the generator 



is given by its index, that is, F2 is the binary field with two 
elements, which is nothing else than the two bits and 1, with 
addition and multiplication modulo 2, ¥ q is the finite field 
with q elements, and ¥q is the finite field with Q elements. If 
q = 2', Q = q m = 2 trn , for some positive integers to and t, 
an element x G ¥q can be represented by an m-dimensional 
vector x = (xi, . . . , x m ) where Xj S ¥ q , i = 1, . . . , to, by 
fixing a F q -basis of ¥q. Similarly, each coefficient Xj can 
be written as x^ = (xn, . . . ,Xn), Xij G F2, so that x may 
alternatively be seen as a im-dimensional binary vector x = 
(xn, . . . , xu, . . . , x m \, . . . , x m t). We say that x is a vector of 
size to to refer to m coefficients in ¥ q , so that q determines the 
unit in which the size m is measured: for example, if q = 2, 
x is m bit long, if q = 8, x is to bytes long. To do explicit 
computations in the finite field F ? , it is often convenient to 
use the generator of the multiplicative group F* that we will 
denote by w. A generator has the property that w q ~ l = 1, and 
there is no smaller positive power of w for which this is true. 
Examples of finite fields that will be used later on are given 
in Table U 

A. Encoding 

Let o be an object of size M to be stored over a network 
of n nodes, that is o e ¥ q M, and let k be a positive integer 
such that k divides M . We can write 

o = (01,. . . , Ok), o, e ¥ q M /k 

which requires the use of a (n, A;) code over ¥ M /k, that maps 
o to an Mn/fc-dimensional vector x, or equivalently, an n- 
dimensional vector 
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5 X n I , Xi G Jr n 



after which each x^ is given to a node to be stored. The theory 
developed below assumes this model, which can be adjusted 
to fit real scenarios as elaborated later in Example [5] 

Since the work of Reed and Solomon [15], it is known that 
linear coding can be done via polynomial evaluation. In short, 
take an object o = (01,02, . . . , Ofe) of size M, with each o^ 
in ¥ M/k, and create the polynomial 



p(X) =0! +o 2 X 



..o k X k ^ £¥ qM/ k[X] 



Now evaluate p(X) in n elements an, . . . , a n G F* A//fc , to get 
the codeword 

(p(ai),...,p(a„)), k<n<q M / k -l. 



Example 1: Suppose that q = 2, so that the size of 
the object is measured in bits. Take the 4 bit long object 
o = (01,02,03,04), and create k = 2 fragments: 01 = 
(01,02) e F 4 , o 2 = {03,04) £ ¥4. We use a (3,2) Reed- 
Solomon code over F4, to store the file in 3 nodes. Recall that 
¥4 = {(a ,ai), a ,ai £ F 2 } = {a + aiw, a ,ai e F 2 } 
where w 2 = w + 1. Thus we can alternatively represent each 
fragment as: 01 = 01 + o 2 w £ F4, 02 = 03 + 04W £ ¥4. The 
encoding is done by first mapping the two fragments into a 
polynomial p(X) £ F 4 [X]: 

p(X) = (01 + o 2 w) + (03 + 04w)X 1 

and then evaluating p(X) into the three non-zero elements of 
F4, to get a codeword of length 3: 

(p(l),p(w),p{w + 1)) 

where p(l) = 01 + o 3 + uj(o 2 + o 4 ), p(w) = o x + o 4 + 
w(o2 + 03 + o 4 ), p(w 2 ) = 01+03+04 + w(o 2 + 03), so that 
each node gets two bits to store: (01 + 03, o 2 + 04) at node 1, 
(01 + 04, o 2 + 03 + 04) at node 2, (o x + 03 + o 4 , o 2 + 03) at 
node 3. 

Definition 1: We call homomorphic self-repairing code, de- 
noted by HSRC(n, k), the code obtained by evaluating the 
polynomial 

fc-i 

P (l)^ftl ! 'eF,„ /t [l] (1) 

i=0 

in n non-zero values a>i,...,a n of F M/k to get an n- 
dimensional codeword 

(p(ai),...,p(a n )), 

where pi = Oj+i, i = 0, . . . , k — 1 and each p(a.i) is given to 
node i for storage. 

In particular, we need the code parameters (n, k) to satisfy 

k < n < q M/k - 1. (2) 

The analysis that follows refers to this family of self- 
repairing codes. 

B. Self-repair 

Since we work over finite fields that contain F 2 , recall 
that all operations are done in characteristic 2, that is, scalar 
operations are performed modulo 2. Let a,b £ ¥ q m, for 
some m > 1 and q = 2*. Then we have that (a + b) 2 = 
a 2 + 2ab + b 2 = a 2 + b 2 since 2ab = mod 2, and 
consequently 

(a + bf = J2 ( 2 -) «^ 2<_j = « 2 ' + k 2 \ * > 1- (3) 



More generally, one can consider a polynomial p(X) over Fq, 
Q = q m , of the form: 



i=o 



Definition 2: A linearized polynomial p(X) over Fq, Q = 
q m , has the form 



fc-i 



p(X) = Y, Pi XQ\ Pt £¥ Q . 



k-l 



p(x) = J2p* xs '- » e 



where s = q l , 1 < I < m (I = m makes p(X) a linearized 
polynomial). These polynomials share the following useful 
property: 

Lemma 1: Let a, b £ ¥ q ™. and let p(X) be the polynomial 



,fe-i 



given by p(X) = Z^^Zq PiX s ' , s = g , m > I > 1. We have 

p(ua + ^6) = up(a) + vp(b), u, v £ ¥ s . 
Proof: If we evaluate p(X) in ua + vb, we get 



fe-i 



fc-i 



p(ua + ufe) = y^p 4 (ua + vb) s ' = y^pi((ua) sl + (vb) s ') 
i=0 j=o 

by (O, and 

fe-i fc-i fc-i 

p(wa + vb) = > Pi(ua s + w6 s ) = u > Pia 9 +w/ Pifo 9 

i=0 i=0 i=0 

using the property that u s = u for u eF s . ■ 

We now define a weakly linearized polynomial as 
Definition 3: A weafcfy linearized polynomial p(X) over 

Fq, Q = q m , has the form 

fe-i 

p(x) = Y J p*x q \ PiZV Q . 

i=0 

We chose the name weakly linearized polynomial, since we 
only retain the F g -linearity, namely: 

Corollary 1: Let a,b £ ¥ q m and let p(X) be a weakly 
linearized polynomial given by p(X) = ^i=o Pi-^- 9 ■ We 
have 

p(ua + vb) = up(a) + vp(b), u, v £ ¥ q . (4) 

In particular 

p(a + b)=p{a)+p(b). (5) 

It is the choice of a weakly linearized polynomial in (Q]i that 
enables self-repair. 

Example 2: Consider the polynomial 

p(X) = p Q X + Pl X 2 +p 2 X 4 £ ¥ S [X}. 

We have (see Table U for Fg arithmetic) 

p(lU+l) = Pa {w + l)+ Pl {w 2 + l)+P2{w 4 + l) = P (w)+p(l) 

however 

p(w 2 ) ^p(w) 2 

since p 2 — pi if and only if pi £ F 2 . 

A codeword from HSRC(n, fc) is then of the form 
(p(ai), . . . ,p(a n )), where p(X) is a weakly linearized poly- 
nomial. Since ¥ qM / k contains a F ? -basis B — {&i, . . . , b M / k }, 
the o/j, i = 1, . . . , n, can be expressed as F 9 -linear combina- 
tions of the basis elements, and we have from Lemma Q] that 

M/k M/k 

di = ^ ctijbj, ctij £¥ q ^> p{di) = ^2 «yP(M- 
j=i i=l 



In words, that means that if p is evaluated in the elements of 
the basis B (or any other basis), then any encoded fragment 
p{a.i) can be obtained as a linear combination of other encoded 
fragments. 

Controlling the amount of self-repaired redundancy. The 
amount of redundancy allowing self-repair introduced in the 
coding scheme can be controlled through two mechanisms: 

1) Firstly, given k fragments, there are different values of n, 
and different choices of {a\, . . . ,a n } that can be chosen 
to define a self-repairing code. Let us denote by n max 
the maximum value that n can take, namely n max = 
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1. By choosing the set of on to form a subspace of 



2) 



F„ max , we can reduce the redundancy while maintaining 
a particularly nice symmetric structure of the code. In 
the extreme case where a>i, . . . ,a n are contained in B, 
the code has no self-repairing property, and is in fact 
a MDS code. Thus SRC can be tuned to provide the 
desired amount of redundancy, from MDS and no self- 
repair, to the maximal amount of self -repair with n max - 
As seen in Lemma Q] the power s of X s in the 
weakly linearized polynomial p(X) determines the ¥ s - 
linearity of p(X). Consequently, the bigger s, the more 
redundancy since from (0]i 



p(ua + vb) = up(a) + vp(b), u, v s 



</■ 



meaning that the encoded fragment p(ua + vb) can be 
repaired by contacting two nodes p(a),p(b) in as many 
ways as there are ways for writing ua + vb, namely 

(q -I) 2 : 



TABLE II 

Examples of small code parameters for q = 2 (on the left) and 

q = 8 (ON THE RIGHT). 



C. Decoding 

That decoding is possible is guaranteed by either Lagrange 
interpolation, or by considering a system of linear equations, 
assuming that 

k < M/k, (6) 

as detailed below. 

Lagrange interpolation. Given k fragments 
p(ai 1 ), . ..,p(ai k ) such that ai 1 ,...,ai k are linearly 
independent, the node that wants to reconstruct the file 
computes q k — 1 linear combinations of the k fragments, 
which gives, thanks to the homomorphic property (O, 
q k — 1 points in which p is evaluated. Lagrange interpolation 
guarantees that it is enough to have q k ~ 1 + 1 points (which 
we have, since q k — 1 > q k ~ l + 1 for k > 2) to reconstruct 
uniquely the polynomial p and thus the object. This requires 
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+ 1 < q M ' k - 1, 



In the particular case where s 
an XOR-like structure 



2, we obtain from (0 



p(a + b) — p(a) +p(b). 

However, it is worth remarking that though the encoded 
fragments can be thus obtained as XORs of each other, 
each fragment is actually containing information about 
all the different fragments, which is very different than a 
simple XOR of the data itself. In particular, HSRC is not 
a systematic code. The implications of lack of systematic 
property will be discussed in Subsection IVI-B I 

Computational complexity of self-repair. In terms of 
computational complexity, the case s = 2 implies that the cost 
of a block reconstruction is that of some XORs (one in the 
most favorable case, when two terms are enough to reconstruct 
a block, up to k — 1 in the worst case), independently of q, 
since if q — 2 4 , the addition in ¥ q is done by addition modulo 2 
componentwise. The cost increases if one would like to exploit 
the F 9 -linearity. Indeed, repairing through 

p(ua + vb) = up(a) + vp(b), u, v £ ¥ q , 

further requires two multiplications in ¥ q . 



namely there must be enough points in which to evaluate the 
polynomial, which holds subject to (O: 

q k < q M l k =*. q k ^ + 1 < q k - 1 < q M ' k - 1. 

Solving a system of linear equations. Alternatively, 
one can consider decoding as solving a system of lin- 
ear equations. Given k linearly independent fragments, say 
p(a,J, . . . ,p(a ik ), we can write 
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a n 
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\ a ik 



( P0 \ 

Pi 



afr 1 ) Wi7 



P(a l2 ) 



\P{on k )) 



and the problem of recovering the object reduces to solving 
the above system of linear equations. Note that since ¥ M / k 
is a vector space of dimension M/k over ¥ q , condition (O 
is needed to guarantee that there exist k linearly independent 
fragments. 

D. Worked out examples 

Let us first illustrate the choices of the code parameters 
(n, k), before detailing some code constructions. 



missing 
fragment(s) 


pairs to reconstruct missing fragment(s) 


P(l) 

p(w) 
p(w 2 ) 


(p(w),p(w' k ));(p(w 2 ),p(w' i ));(p(w b ),p(w w )) 
(p(l),p{w 4 ));(p(w 2 ),p(w 5 ));(p(w 8 ),p(w 10 )) 
(p(l),p(w 8 ))-(p(w),p(w :i )y,(p(w 4 ),p(w 10 )) 


p(l) and 
p(w) 


(p(w 2 ),p(w >i )) or (p(w'°),p(w w )) for p(l) 
(p(u> 8 ),p(u> 10 )) or (p(w 2 ),p(w 5 )) for p(to) 


p(l) and 

p(w) and 

p(w 2 ) 


(p(«i b ),p(wi lu )) forp(l) 
(p(w 8 ),p(w 10 )) for p(ru) 
(p(«i 4 ),p(w 10 )) forp(™ 2 ) 



TABLE III 
Ways of reconstructing missing fragment(s) in Example[5] 

We recall that the parameters (n, k) of an HSRC(n, k) code 
must satisfy conditions (O and ||6): 

k<n< q M/k - 1, k < M/k. 

Thus for any choice of k: 

1) pick any M which is a multiple of k (zero padding can 
be used to remove the constraint on the real size of the 
object), 



2) define 



-Mjk 



1, 



3) pick any n such that 

Ti > /C, 71 < Tlmax 

which is a power of g minus 1 (this last condition is not 
completely necessary but ensures symmetry as already 
mentioned above). 
Some examples of small parameters (n, k) are given in Table 
Mforq= 2 and q = 8, 

Example 3: Take a data file o = (oi, . . . , O12) of M = 12 
bits (g = 2), and choose fc = 3 fragments. We have that 
Af/fc = 4, which satisfies ©, that is k = 3 < Af/fc = 4. 

The file o is cut into 3 fragments 01 = (01, . . . , 04), 02 = 
(o 5 , ... , o 8 ), o 3 = (o 9 , . . . , 012) G F 2 4. Let lobea generator 

1. The 



of the multiplicative group F| 4 , such that u;' 
polynomial used for the encoding is 



w 



p(X) = ^ o lW l X + J2 o l+i w l X 2 + J2 o t+S w l X 4 . 

i—l i— 1 i— 1 

The n-dimensional codeword is obtained by evaluating p(X) 
in n elements of F 2 4, n < 15 = n max by @. 

For n = 4, if we evaluate p(X) in to*, i = 0, 1, 2, 3, then 
the 4 encoded fragments p(l),p(w),p(w 2 ),p(w 3 ) are linearly 
independent and there is no self-repair possible. 

Now for n = 7, and say, 1, w, w 2 , w i 1 w 5 ,w 8 , w 10 , we get: 

(p(l),p(w),p(w 2 ),p(w%p(w 5 ),p(w 8 ),p(w w )). 

Suppose node 5 which stores p(w 5 ) goes offline. A new comer 
can get p(w 5 ) by asking for p(w 2 ) and p(w), since 

p(w 5 ) — p(w 2 + w) = p(w 2 ) + p{w). 

Table [III] shows other examples of missing fragments and 
which pairs can reconstruct them, depending on if 1, 2, or 
3 fragments are missing at the same time. 

As for decoding, since p(X) is of degree 5, a node that 
wants to recover the data needs k — 3 linearly independent 



fragments, say p(w),p(w 2 ),p(w 3 ), out of which it can gener- 
ate p(aw + bw 2 +cw 3 ), a,b,c 6 {0, 1}. Out of the 7 non-zero 
coefficients, 5 of them are enough to recover p. 

Example 4: Take now a data file o = (pi,..., Oi&) of M = 
16 bytes (that is q = 8), and choose k = 4 fragments. We have 
that M/k = 4, which satisfies ©, that is k < M/k. 

The file o is cut into 4 fragments 01 = (01, . . . , 04), o 2 = 
(o 5 , . . . , o 8 ), o 3 = (o 9 , . . . , 012), o 4 = (013, . . . , oie) e F 8 4. 
Let w be a generator of the multiplicative group of F§, and 
v be a generator of the multiplicative group of F 8 4/F§ such 
that v A = v 3 + w. The polynomial used for the encoding can 
be either 

4 4 

p(X) = J2 Oiv'X + J2 o l+ 4V l X s 



4 

E 



4 



Oj +8 Wj/X D4 + ^ Oj + i 2 Wl/X 



5 1.2 



or 



p'(X) = J2 Oi^X + J2 o l+i v l X 2 

i=l i=l 

4 4 

+ ^ o l+8 w^X 4 + Y^ o i+12 wv l X 8 . 
i=l i=l 

The n-dimensional codeword is obtained by evaluating p(X) 
in n elements of F 8 4, n < 8 4 — 1 = n max by ©. 

Now for n = 63, and say, {u + vv, u,v £ F 8 , (u, u) ^ 
(0,0)}, we get: 

{p(u + uv), u,v£¥ s , (u,v) ^ (0,0)}, 

respectively 

{p'(u + vv), u,v(=F 8 , (u,v) ^(0,0)}. 

Let us give an example of repair in both cases. Let us start 
with p'(X). Suppose the node which stores p'(w + v) goes 
offline. A new comer can get p'(w + v) by asking for p'(w) 
and p'(v), since 

p'(w + v) = p'(w) + p'(v). 

If we are instead using p(X), when the node which stores 
p(w + v) goes offline, then for any choice of u, v ^ 0, u, v 6 
F 8 , a new comer can ask p(uw) and p(vv), and then compute 

u~ l p(uw) + v~ l p(w) = p(w) + p(v). 

Example 5: The HSRC codes described above implicitly 
assume a specific, fixed input size, determined by the choices 
of n, k, q on which the coding is to be performed. This is 
also the case for many other coding schemes such as Reed- 
Solomon codes. In real life, data objects may however come 
in an arbitrary size. Two heuristics deal with the consequent 
constraints - namely, zero padding (for object which is too 
small), and slicing (for a large object). 

For instance, consider as object o a file of 5MB = 5 • 2 10 KB 
= 5-2 20 B, so that M = 5-2 20 for q = 8. We cut o into 2 8 slices 
81, ... , s 256 , each slice has size M' = 5 ■ 2 12 B=20 KB. We 
now encode each Sj into a codeword x, = (xn, . . . , Xi.511), 



using a (511, k') codes, with k' = 5 • 2 4 = 80, so that the size 
of the encoded blocks is M'/k' = 2 8 = 256B. Then the jth 
node stores the j'th encoded fragment Xij j — 1, . . . , 511 for 
all the slices i — 1, . . . , 256. To get a point of comparison of 
the parameter values (n, k) used in this example, we note that 
Wuala uses a (517,100) code, while (255,223) Reed Solomon 
codes (in bytes) are also widely used. 

A final observation we want to make here is that, in the 
following section, when we analyse the static resilience of a 
code, it determines the availability for one slice of the object, 
rather than the object itself. However, placing the encoded 
fragments of all the slices in a common pool of storage 
nodes - which is also practical in terms of managing meta- 
information - leads to the same availability of the object, as 
for the individual slices. Hence we do not distinguish the two 
in the rest of this paper, and consider that the code could be 
applied on any object, independently of its size. It has to be 
noted here that if the encoded fragments of different slices 
were to be placed among different set of nodes, this would 
however not hold true. This however is an issue we will not 
delve into any further, and is also not usually practiced due to 
practical system design considerations. 

IV. Static Resilience Analysis 

The rest of the paper is dedicated to the analysis of the 
proposed homomorphic self-repairing codes. Static resilience 
of a distributed storage system is defined as the probability 
that an object, once stored in the system, will continue to stay 
available without any further maintenance, even when a certain 
fraction of individual member nodes of the distributed system 
become unavailable. We start the evaluation of the proposed 
scheme with a static resilience analysis, where we study how 
a stored object can be recovered using HSRCs, compared with 
traditional erasure codes, prior to considering the maintenance 
process, which will be done in Section IVl 

Let Pnode be the probability that any specific node is 
available. Then, under the assumptions that node availability 
is i.i.d, and no two fragments of the same object are placed 
on any same node, we can consider that the availability of any 
fragment is also i.i.d with probability p no d e - 

A. A network matrix representation 

Recall that using the above coding strategy, an object o of 
length M is decomposed into k fragments of length M/k: 

o = (oi, . . . , Ok), o, e ¥ qM /k, 

which are further encoded into n fragments of same length: 

x= (xi,...,x„), Xj e FgM/fc, 

each of the encoded fragment jq = p{cti) is given to a node 
to be stored. We thus have n nodes each possessing a q-ary 
vector of length M/k, corresponding to a system of n linear 
equations 
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\ / Po \ (p(ax)\ 

Pi P(«2) 

ag- 1 ) \Pk-J \p(a n )J 



If three rows are F g -linearly dependent, say rows 1,2 and 3, 
then 

u{a.x, a\, a\ , . . . ,a\ ) + v(a2 1 a\,a\ , . . . , a| ) 

= (a 3 ,a|,a| ,...,a| ),u,v€¥ q , 

which can be rewritten as 

i i <? i i q 2 i q 2 q h_1 i q k ~ 1 \ 

(ua.i + va 2 , ua\ + vaQ, ua i +va\ ,...,««] + va 2 ) 

= (ua 1 +va 2 , (uai+va 2 ) q , (uai+«a 2 ) 9 , • ■ • , (uai+va 2 ) q 



= («3,a!> ( 



: 3 >• • 



,a| ), u,v e¥ q 



uai+va 2 = G13. 



Thus to understand the linear dependencies among the frag- 
ments owed by each of the n nodes, one can associate to the 
ith node the value <Xj, Once all the c*j are written in a F g -basis, 
they can be represented as an n x M/k q-ary matrix 
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ith dij G ¥ q 







OCi t M/k 



a n,M/k 



Example 6: In Example [3] by fixing as 
{1, w, w 2 ,w 3 }, we have for n — 4 that M = 
4-dimensional identity matrix, since a, 
while for n = 7, it is 



(7) 



F2-basis 

I4, the 

0, ...,3, 



/ 1 1 1 

10 110 

10 11 

\ 

corresponding to l,w,w 2 



w , w , w , w 



w 

1\ 

1 
1 

0/ 

. Now in Example 



|U by fixing as Fg-basis {1, v, v 2 1 v 3 }, we have 
fu\ 

V 

, u,v GF 8 , (u,v) ^ 0. 

Thus unavailability of a random node is equivalent to losing 
one linear equation, or a random row of the matrix M. If 
multiple random nodes (say n — x) become unavailable, then 
the remaining x nodes provide x encoded fragments, which 
can be represented by a x x M/k sub-matrix M^ of M. For 
any given combination of such x available encoded fragments, 
the original object can still be reconstructed if we can obtain 
at least k linearly independent rows of M x . This is equivalent 
to say that the object can be reconstructed if the rank of the 
matrix M x is larger than or equal to k. 

In the case the polynomial p(X) = Yli=o Pi-^ 2 1 Pi £ IF, 
is chosen for encoding, the corresponding system of n linear 
equations is slightly different 



o 2 
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so that now, though it is still true that if three rows are ¥ q - 
linearly dependent, say rows 1,2 and 3, then 



t 2 2 



) + v(a 2 , a\,a 



2" 
2 ) ' 
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, n o2 r\k — 1 

= (a>3,a 3 ,a 3 ,...,a 3 ), u,v £W q , 
it is not true anymore that 

f 2 2 2 2 2 2 2 fc_1 2 fc — 1 \ 

(uai + wa2, ua x + va 2 ,ua 1 + va 2 ,...,ua 1 + va 2 ) 
= (uai+va 2 , (uai+va 2 ) , (uai+va 2 ) , . . . , (uai+va 2 ) ) 

since u 2 = u, resp. v 2 = u holds if and only if u, u G F2. In 
this case, we have to analyze the matrix 
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/ 



directly. 

Example 7: Consider again Example |4] and suppose the 
two polynomials p(X) and p'(X) are both evaluated in 
u\ = 1, a 2 = v, and 0:3 = w + v. Clearly wax + a 2 — 
a3. Consequently, when evaluating the polynomial p(X) in 
0.1,0.2,0.3, we get as part of the system of linear equations 
the following 3 rows: 



and 



(1,1,1,1), (1,, i/VV 1 *) 

(w + v, (w + v)\ (w + vf\ (w + vf 12 ) = 

(w + v,w s + v\ w 6i + v 6 \ w 512 + v bl2 ). 

Cleary the three rows are linearly dependent. If now instead 
the polynomial p'(X) is similarly evaluated, we obtain 



(1,1,1, !),(!/, I/ 2 , 1/ 4 , 1/ 8 ) 



and 



(w + v,{w + vf, (w + vf, (w + v) 8 ). 
This time the dependencies disappear, since 

w + v 2 ^ (w + v) 2 = w 2 + v 2 . 



B. Probability of object retrieval 

Consider a (q d — 1) x d q-ary matrix for some d > 1, with 
distinct rows, no all zero row, and thus rank d. The case of 
interest for us is d = Al/k, since M is an q M / k — 1 x M/k 
matrix. If we remove some of the rows uniformly randomly 
with some probability 1 — Pnode> th en we are left with a x x 
d sub-matrix - where x is binomially distributed. We define 
R(x, d, r) as the number of x x d sub-matrices with rank r, 
voluntarily including all the possible permutations of the rows 
in the counting. 

Lemma 2: Let R(x, d, r) be the number of x X d sub- 
matrices with rank r of a tall (q d — 1) x d matrix of rank 
d. We have that R(x, d,r) — when (i) r = 0, (ii) r > x, (iii) 



r = x, with x > d, or (iv) r < x but r > d. Then, counting 
row permutations: 

r-l 

R(x, d, r) = Y[ (l d - I 1 ) if r = x,x<d, 

i=0 

and for r < x with r < d: 

R{x,d,r) = R(x-l,d,r-l)(q d -q r - 1 )+R(x-l,d,r)(q r -x). 

Proof: There are no non-trivial matrix with rank r = 0. 
When r > x, r = x with x > d, or r < x but r > d, 
R(x,d,r) = since the rank of a matrix cannot be larger 
than the smallest of its dimensions. 

For the case when r = x, with x < d, we deduce R(x, d, r) 
as follows. To build a matrix M. x of rank x = r, the first row 
can be chosen from any of the q d — 1 rows in M, and the second 
row should not be a multiple of the first row, which gives q d — 2 
choices. The third row needs to be linearly independent from 
the first two rows. Since there are q 2 linear combinations of 
the first two rows, which includes the all zero vector which 
is discarded, we obtain q d — q 2 choices. In general, the (i + 
l)st row can be chosen from q d — q % options that are linearly 
independent from the % rows that have already been chosen. 
We thus obtain R(x, d, r) — n[=o (l d ~ 1 % ) f° r r = x, x < d. 

For the case where r < x with r < d, we observe that 
x X d matrices of rank r can be inductively obtained by either 
(I) adding a linearly independent row to a (x — 1) x d matrix 
of rank r — 1, or (II) adding a linearly dependent row to a 
(x — 1) X d matrix of rank r. We use this observation to derive 
the recursive relation 

R(x,d,r) = R(x-l,d,r-l)(q d -q r - l )+R(x-l,d,r)(q r -x), 

where q d — 1 — (q r ~ 1 — 1) counts the number of linearly 
independent rows that can be added, and q r — 1 — (x — 1) is 
on the contrary the number of linearly dependent rows. ■ 
We now remove the permutations that we counted in the above 
analysis by introducing a suitable normalization. 

Corollary 2: Let p(x, d, r) be the fraction of sub-matrices 
of dimension xxd with rank r out of all possible sub-matrices 
of the same dimension. Then 

R(x,d,r) R(x,d,r) 



p(x, d, r) 



In particular 



EURfadj) cf-\\ 



Px(d) = ^2p(x,d,r) 



(9) 



is the conditional probability that the stored object can be 
retrieved by contacting an arbitrary x out of the n storage 
nodes. 

Proof: It is enough to notice that there are C| _1 ways to 
choose x rows out of the possible q d — 1 options. The chosen 
x rows can be ordered in xl permutations. 

In particular, when the rank is at least k, the object can be 
retrieved. ■ 

We now put together the above results to compute the 
probability p bj of an object being recoverable when using 
an HSRC(n, k) code to store a length M object made of k 
fragments encoded into n fragments each of length M/k. 






(a) Validation of the static resilience analysis (b) Comparison of SRC with EC (c) Comparison of SRC with EC 

Fig. 1. Static resilience of homomorphic self-repairing codes (HSRC) for q = 2: Validation of analysis, and comparison with MDS erasure codes (EC) 



Corollary 3: Using an HSRC(n, k), the probability p bj of 
recovering the object is 



(1 -PnodeY 



node 



a,,,,:, = q m ' n — 1, we apply Lemma [2] and 



Pobj = 53 53 Pfo d ' r ) C xPn 
x=k r=fc 

where d — log n + 1. 

Proof: If n = n max = q M / k 
Corollary [2] with d = M/k. If n — q 1 — 1, for some integer 
i < M/fc such that n > k (otherwise there is no encoding), 
then M has M/k — i columns which are either all zeros or all 
ones vectors, as shown on Example [6] Thus the number of its 
sub-matrices of rank r is given by applying Lemma [2] on the 
matrix obtained by removing these redundant columns. ■ 
We validate the analysis with simulations, and as can be 
observed from Figure |l(a)| we obtain a precise match. 

To conclude this analysis, let us get back to Example |4] and 
notice that the static resilience analysis derived above holds 
for the encoding via the polynomial p(X). It would not be 
the case if p'(X) were used, in which case only the F2-linear 
dependencies should be kept. 

C. Comparison with standard erasure codes 

While there is marginal deterioration of static resilience 
using SRC with respect to MDS codes (as compared in Fig. 
[U and to be discussed soon after), we first elaborate how 
SRC differs fundamentally from MDS codes by looking at the 
conditional probability that the stored object can be retrieved 
by contacting an arbitrary x out of the n storage nodes. 

For (n, k) MDS erasure codes, p x is a deterministic and 
binary value equal to one for x > k, and zero for smaller x. 
For self -repairing codes, the value is probabilistic. In Fig. [2] we 
show for our toy example HSRC(3l,5) the probability that 
the object can be retrieved by contacting arbitrary x nodes, i.e., 
p x , where the values of p x for x > k were computed from 
©g This can alternatively be interpreted as the probability 
that the object is retrievable despite precisely n — x random 
failures, and only x random storage nodes are available. 

In particular, if any five storage nodes are randomly picked, 
it is likely that the object cannot be reconstructed with a 

2 p x is zero for x < k for HSRCalso. 



probability 0.5096, while if any seven random nodes are 
picked, this probability decreases to 0.0757, while, if thirteen 
or more random nodes are picked, then the object can certainly 
be reconstructed. In contrast, for MDS codes, the object will be 
retrievable from the data available at any arbitrary five nodes. 

Of-course, this rather marginal sacrifice (we will next com- 
pare HSRC's static resilience with MDS erasure codes to 
demonstrate the marginality) provides HSRC an incredible 
amount of self-repairing capability. Given that a practical 
system will carry out repairs rather frequently, and HSRC in 
fact allows very cheap repairs, a system using HSRC will be 
more easily and cheaply maintained, and hence be reliable - 
particularly be avoiding multiple failures to cumulate. 

Likewise, for data access and reconstruction, in practice, 
storage nodes will be accessed in a planned manner, rather 
than randomly. There are in fact CJ}pk (i.e., 83324 for 
HSRC(31,5)) unique subsets of precisely k storage nodes 
that allows reconstruction. Hence, in practice, object access 
overheads will not be different than when using a MDS coding 
based scheme. 

Let us now compare HSRC against standard MDS erasure 
codes in terms of the effective static resilience. If we use a 
(n, k) MDS erasure code, then the probability that the object is 
recoverable when each individual storage node may fail i.i.d. 
with probability 1 — p no d e is: 



Pobj 



i=k 



Ci Pnodei*- Pnode) 



Note that MDS codes may not exist for specific arbitrary 
choice of n and k. However, for the sake of fair comparison, 
this formula and the following plots are provided as if they 
were to exist. 

In Figures |l(b)| and |l(c)| we compare the static resilience 
achieved using the proposed homomorphic SRC with that of 
MDS erasure codes. 

In order to achieve the self-repairing property in SRC, it is 
obvious that it is necessary to introduce extra 'redundancy' 
in its code structure, but we notice from the comparisons 
that this overhead is in fact marginal. For the same storage 
overhead n/k, the overall static resilience of SRC is only 
slightly lower than that of EC, and furthermore, for a fixed 
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Fig. 2. Comparison of the probability of reconstruction of object using 
encoded data in x random storage nodes 



k, as the value of n increases, SRC's static resilience gets 
very close to that of EC. Furthermore, even for low storage 
overheads, with relatively high p no de, the probability of object 
availability is indeed 1 . In any storage system, there will be a 
maintenance operation to replenish lost fragments (and hence, 
the system will operate for high values of p no de)- We will 
further see in the next section that SRCs have significantly 
lower maintenance overheads. These make SRCs a practical 
coding scheme for networked storage. 

V. Communication overheads of self-repair 

In the previous section we studied the probability of re- 
covering an object if it so happens that only p no d e fraction 
of nodes which had originally stored the encoded fragments 
continue to remain available, while lost redundancy is yet to be 
replenished. Such a situation may arise either because a lazy 
maintenance mechanism (such as, in |Q~)) is applied, which 
triggers repairs only when redundancy is reduced to certain 
threshold, or else because of multiple correlated failures before 
repair operations may be carried out. We will next investigate 
the communication overheads in such scenarios, emphasizing 
on those HSRC with an XOR-like structure (that is, retaining 
F2-linearity). Note that this is really the regime in which we 
need an analysis, since in absence of correlated failures, and 
assuming that an eager repair strategy is applied, whenever 
one encoded block is detected to be unavailable, it is immedi- 
ately replenished. The proposed HSRC ensures that this one 
missing fragment can be replenished by obtaining only two 
other (appropriate) encoded fragments, thanks to the HSRC 
subspace structure. 

Definition 4: The diversity 5 of SRC is defined as the 
number of mutually exclusive pairs of fragments which can 
be used to recreate any specific fragment. 

In Example [3] it can be seen easily that <5 = 3. Let 
us assume that p(w) is missing. Any of the three exclu- 
sive fragment pairs, namely ((p(l),p(w 4 )); (p(w 2 ) , p(w 5 )) or 
(p(w s ),p(w 10 )) may be used to reconstruct p(w). See Table 
Hill for other examples. In Example [4] where the encoding is 



done using p'(X), the diversity is 8 = 31. Indeed, every 
encoded fragment is of the form p'(u + vv), u,v £ Fg, so 
that for every u' + vv', u' , v' £ Fg, we have that the pair 
(p' (u' +vv') , p' ((«' +u)+v(v' +v))) can be used to reconstruct 
p'(u + vv), since p'(v! + vv') + p'((u' + u) + v(v' + v)) = 
p'(u + vv), and the fragment p'((u' + u) + v(v' + v)) is indeed 
present in the network since u',v' £ Fg and u,v run through 
every element in Fg but for the pair (0, 0). 

Lemma 3: The diversity 6 of a HSRC(n, k) is (n — l)/2. 

a d — 1 for some suitable d. 



Proof: We have that 



d-l 
=0 



atw 



where 



The polynomial p{x) is evaluated in a — Y^ 
a,i £ F q and (ao, ..., cid-i) takes all the possible q d values, but 
for the whole zero one. Thus for every a, we can create the 
pairs (a + (3, j3) where (3 takes q d — 2 possible values, that is 
all values besides and a. This gives q d — 2 (which is equal 
to n — 1) pairs, but since pairs (a + f3, f3) and ((3, a + /3) are 
equivalent, we have (n — l)/2 distinct such pairs. ■ 

An interesting property of SRC can be inferred from its 
diversity. 

Corollary 4: For a Homomorphic SRC, if at least (n + l)/2 
fragments are available, then for any of the unavailable frag- 
ments, there exists some pair of available fragments which is 
adequate to reconstruct the unavailable fragment. 

Proof: Consider any arbitrary missing fragment a. If up 
to (n— 1)/2 fragments were available, in the worst case, these 
could belong to the (n — l)/2 exclusive pairs. However, if an 
additional fragment is available, it will be paired with one of 
these other fragments, and hence, there will be at least one 
available pair with which a can be reconstructed. ■ 



A. Overheads of recreating one specific missing fragment 

Recall that x is defined as the number of fragments of an 
object that are available at a given time point. For any specific 
missing fragment, any one of the corresponding mutually 
exclusive pairs is adequate to recreate the said fragment. From 
CorollaryH]we know that if x > (n+l)/2 then two downloads 
are enough. Otherwise, we need a probabilistic analysis. Both 
nodes of a specific pair are available with probability (x/n) 2 . 
The probability that only two fragments are enough to recreate 
the missing fragment is p2 = 1 — (1 — (x/n) 2 ) s . 

If two fragments are not enough to recreate a specific 
fragment, it may still be possible to reconstruct it with 
larger number of fragments. A loose upper bound can be 
estimated by considering that if 2 fragments are not adequate, 
k fragments need to be downloaded to reconstruct a fragment^ 
which happens with a probability 1 — p% = (1 — (x/n) 2 ) s . 

Thus the expected number D x of fragments that need to be 
downloaded to recreate one fragment, when x out of the n 
encoded fragments are available, can be determined as: 



D x = 2 

D x < 2p 2 + k{\ - 



P2) 



if x > (n + l)/2 
if x < (n+l)/2. 



3 Note than in fact, often fewer than k fragments will be adequate to 
reconstruct a specific fragment. 
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B. Overhead of recreating all missing fragments 

Above, we studied the overheads to recreate one fragment. 
All the missing fragments may be repaired, either in parallel 
(distributed in different parts of the network) or in sequence. 
If all missing fragments are repaired in parallel, then the total 
overhead D pr i of downloading necessary fragments is: 

Dpri = {n-x)D x . 

If they are recreated sequentially, then the overhead D seq 
of downloading necessary fragments is: 



n=31 , k= 



D 



seq 



= 5> 



In order to directly compare the overheads of repair for 
different repair strategies - eager, or lazy parallelized and 
lazy sequential repairs using SRC, as well as lazy repair 
with traditional erasure codes, consider that lazy repairs are 
triggered when a threshold x — Xth of available encoded 
fragments out of n is reached. If eager repair were used for 
SRC encoded objects, download overhead of 

D egr = 2(n - xth) 

is incurred. Note that, when SRC is applied, the aggregate 
bandwidth usage for eager repair as well as both lazy repair 
strategies is the same, assuming that the threshold for lazy 
repair x t h >{n + l)/2. 

In the setting of traditional erasure codes, let us assume 
that one node downloads enough (k) fragments to recreate the 
original object, and recreates one fragment to be stored locally, 
and also recreates the remaining n — x t h — 1 fragments, and 
stores these at other nodes. This leads to a total network traffic: 

DEClazy =k + n—X t h-l- 

Eager strategy using traditional erasure codes will incur k 
downloads for each repair, which is obviously worse than all 
the other scenarios, so we ignore it in our comparison. 

Note that if less than half of the fragments are unavailable, 
as observed in Corollary |4] downloading two blocks is ad- 
equate to recreate any specific missing fragment. When too 
many blocks are already missing, applying a repair strategy 
analogous to traditional erasure codes, that of downloading 
k blocks to recreate the whole object, and then recreate all 
the missing blocks is logical. That is to say, the benefit of 
reduced maintenance bandwidth usage for SRC (as also of 
other recent techniques like RGC) only makes sense under 
a regime when not too many blocks are unavailable. Let us 
define x c as the critical value, such that if the threshold for lazy 
repair in traditional erasure codes xth is less than this critical 
value, then, the aggregate fragment transfer traffic to recreate 
missing blocks will be less using the traditional technique (of 
downloading k fragments to recreate whole object, and then 
replenish missing fragments) than by using SRC. Recall that 
for x > (n + l)/2, D egr = D pr i — D seq . One can determine 




x r as follows. We need D Pnr < D 



EClazy 



2n — 2x c < n — 1 



x c 



implying that 



1 



Fig. 3. Average traffic normalized with B/k per lost block for various 
choices of xth- 

Figure [3] shows the average amount of network traffic to 
transfer data from live nodes per lost encoded fragment when 
the various lazy variants of repair are used, namely parallel 
(7pw) and sequential (j S eq) repairs with SRC, and (by default, 
sequential) repair (jedazy) when using EC. RGCs are also 
shown on this figure - see Jmsrgc, where d is the number 
of live nodes contacted during repair. 

The x-axis represents the threshold Xth f° r l az Y repair, such 
that repairs are triggered only if the number of available blocks 
for an object is not more than Xth- Use of an eager approach 
with SRC incurs a constant overhead of two fragments per 
lost block. Note that there are other messaging overheads 
to disseminate necessary meta-information (e.g., which node 
stores which fragment), but we ignore these in the figure, 
considering that the objects being stored are large, and data 
transfer of object fragments dominates the network traffic. This 
assumption is reasonable, since for small-objects, it is well 
known that the meta-information storage overheads outweigh 
the benefits of using erasure codes, and hence erasure coding 
is impractical for small objects. 

There are several implications of the above observed be- 
haviors. To start with, we note that an engineering solution 
like lazy repair which advocates waiting before repairs are 
triggered, amortizes the repair cost per lost fragment, and 
is effective in reducing total bandwidth consumption and 
outperforms SRC (in terms of total bandwidth consumption), 
provided the threshold of repair x t h is chosen to be lower 
than x c . This is in itself not surprising. However, for many 
typical choices of (n, k) in deployed systems such as (16, 10) 
in Cleversafe [2|, or (517, 100) in Wuala QO), a scheme like 
SRC is practical. In the former scenario, x c is too low, and 
waiting so long makes the system too vulnerable to any further 
failures (i.e., poor system health). In the later scenario, that is, 
waiting for hundred failures before triggering repairs seems 
both unnecessary, and also, trying to repair 100 lost fragments 
simultaneously will lead to huge bandwidth spikesQ 

Using SRC allows for a flexible choice of either an eager or 

4 A storage system's vulnerability to further failures, as well as spiky 
bandwidth usage are known problems of lazy repair strategies (9). 
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lazy (but with much higher threshold x t h) approaches to carry 
out repairs, where the repair cost per lost block stays constant 
for a wide range of values (up till Xth > ( n + l)/2). Such a 
flexible choice makes it easier to also benefit from the primary 
advantage of lazy repair in peer-to-peer systems, namely, to 
avoid unnecessary repairs due to temporary churn, without the 
drawbacks of (i) having to choose a threshold which leads to 
system vulnerability or (ii) choose a much higher value of n 
in order to deal with such vulnerability, and (iii) have spiky 
bandwidth usage. 

C. Maximal distance separability & minimum storage point 

To conclude the discussion on the cost of repair, this 
subsection gives some points of comparison between the now 
well known RGC and the newly introduced SRC. The theory 
underlying regenerating codes exposes an interesting trade- 
off between the storage and repair bandwidth overheads for 
maximal distance separable codes - where the data is encoded 
and stored over n nodes, and encoded data stored at any 
arbitrary fc of these storage nodes allows reconstruction of 
the whole object. 

Suppose that each node has a storage capacity of a, i.e., 
the size of the encoded data block stored at a node is of 
the size a. When one data block needs to be regenerated, 
a new node contacts d live nodes, and downloads /3 amount 
of data from each of the contacted nodes (referred to as the 
bandwidth capacity of the connections between any node pair). 
By considering an information flow from the source to the data 
collector, a trade-off between the nodes' storage capacity and 
bandwidth is computed through a min-cut bound. This analysis 
determines two interesting constraints. Firstly, regeneration of 
a lost node is feasible only when at least fc live nodes are 
contacted, i.e. d > k. Secondly, it determines a trade-off curve 
between the storage overhead per node a and the bandwidth 
per regeneration df3. One extreme of this trade-off curve 
corresponds to the smallest feasible value of a, which is B/k 
(the other one being the smallest feasible value of /3, called 
the minimal bandwidth repair point (MBR)). We note that this 
storage overhead corresponds to any optimal encoding scheme 
which aims to reconstruct the object using no more than k 
encoded blocks. This point on the trade-off curve is called 
the minimum storage repair point (MSR), determining the 
minimal bandwidth requirement for regeneration according the 
min-cut max-fiow arguments of information flow as follows: 

(&MSR, PmSB.) — ( -T, TTj 



fc + 1) 

A similar minimum storage point, computed using the same 
type of arguments, is available for collaborative RGCs, and 
takes the form 



B 

F 



P = P 



B 



1 



k d-k+t 



where (3' MSR denotes the bandwidth used for cooperation and 
t is the number of new nodes regenerating together and in 
cooperation with each other (thus, t could be interpreted as 
the number of failures triggering lazy, collaborative repair). 



Since encoded blocks in HSRC(n, k) codes are also of size 
B/k, a meaningful comparison is possible corresponding to 
the minimal storage (MSR) point. We note that firstly, HSRC 
achieves ad << k, which breaches the d > k constraint 
of RGCs. Furthermore, we notice from Figure [3] that HSRC 
can carry out regenerations with less bandwidth per repair 
than RGCs, for certain number of faults, and certain RGC 
parameter choices. 

At first sight, this may seem counter-intuitive, given that 
the max-fiow min-cut analysis establishes hard achievability 
constraints. But recall that these constraints were determined 
under the assumption of maximal distance separability of the 
resulting code. Thus to say, the significantly superior HSRC 
based repair performance is obtained by relaxing the MDS 
constraint, which, as discussed in Section HV-CI has marginal 
practical drawbacks or overheads. 

We will like to add furthermore, that while the max-fiow 
min-cut bound determines achievability constraint, it does not 
indicate what coding scheme may achieve the same. The 
currently existing RGC coding schemes in fact often do not 
support arbitrary values of d. Typical codes in literature are for 
d = n— 1 (for MBR) or d — fc+1 (for MSR), and often support 
only a single repair at a time, which means, for all practical 
purposes, the benefit of using HSRC is even stronger. This is 
because, even if there are regimes where RGCs do better (as 
may seem from Fig. [5]), those can not in practice be reached 
with any currently known RGC codes. 

The proposed HSRC code also does not have any hidden 
constraint on the underlying field size considered. This is in 
contrast to the implicit assumptions in RGC that both a suitable 
MDS erasure code and network code exist, whose existence 
typically relies on the ability of finding solutions to given 
systems of linear equations. Given a choice of parameters 
(n, fc), an MDS code does not exist for every field size. As 
for the system of linear equations, solutions tend to exist when 
the field size becomes big enough. Bounds on the field size 
for a network code to exist is a topic of ongoing study in the 
area of network coding. 

Finally, it is worth pointing out the significance of the choice 
of d. A typical value of fc as used with Wuala is about 100, 
this means that the number of nodes contacted for one repair 
is more than a hundred, whereas HSRC in contrast can repair 
one node by communicating with only two nodes. 

VI. Other practical implications: A qualitative 

DISCUSSION 

So far we have demonstrated that by embracing the self- 
repairing properties, significant reduction in the aggregate 
bandwidth used for repairs is achieved. This overhead reduc- 
tion is with respect to not only traditional erasure codes, but 
under certain regimes, also in comparison to other 'optimal' 
storage centric codes such as regenerating codes. While repair 
bandwidth overhead reduction was the explicit motivation 
for designing self-repairing codes, the code properties have 
another natural and desirable consequence - in presence of 
multiple failures, SRC allows for fast and parallel repairs. We 
will elaborate this property with an example in next subsection. 



Apart the lack of maximum distance separability (MDS) 
property, another possible critique of HSRC is that it is not a 
systematic code, that is, pieces of the object are not present 
uncoded. We have already argued that the original design 
goal of the self-repairing properties themselves are mutually 
exclusive with the MDS property, but based on quantitative 
arguments (see Section HV-Ct , we concluded that this has 
marginal impact on the resilience or storage overheads of the 
proposed code. We note that the systematic code property 
is not necessarily and completely exclusive of the cardinal 
self-repairing code properties. Indeed, a different construct of 
self-repairing code [13] based on very different mathematical 
properties, that of projective geometry, has been shown to 
have systematic-like features. Later in this section, we provide 
some qualitative arguments on why the lack of systematic 
property may not have significant implications in terms of 
decoding, precisely because of the strong self-repairing prop- 
erties; besides highlighting that some real life storage system 
deployment even intentionally avoid using systematic encoded 
blocks in order to enhance security. 

A. Fast & parallel repairs with HSRC 

We observed in the previous section that while SRC is 
effective in significantly reducing bandwidth usage to carry 
out maintenance of lost redundancy in coding based distributed 
storage systems, depending on system parameter choices, an 
engineering solution like lazy repair while using traditional 
EC may (or not) outperform SRC in terms of total bandwidth 
usage, even though using lazy repair with EC entails several 
other practical disadvantages. 

A final advantage of SRC which we further showcase next 
is the possibility to carry out repairs of different fragments 
independently and in parallel (and hence, quickly). If repair 
is not fast, it is possible that further faults occur during the 
repair operations, leading to both performance deterioration as 
well as, potentially, loss of stored objects. 

Consider the following scenario for ease of exposition: 
Assume that each node in the storage network has an up- 
link/downlink capacity of 1 (coded) fragment per unit time. 
Further assume that the network has relatively (much) larger 
aggregate bandwidth. Such assumptions correspond reason- 
ably with various networked storage system environments. 

Consider that for the Example [5J originally n was chosen 
to be n m ax, that is to say, a HSRC(15,3) was used. Because 
of some reasons (e.g., lazy repair or correlated failures), let 
us say that seven encoded fragments, namely p(l), . . . ,p(w 6 ) 
are unavailable while fragments p(w 7 )...p(w 15 ) are available. 
Table [TV] enumerates possible pairs to reconstruct each of the 
missing fragments. 

A potential schedule to download the available blocks 
at different nodes to recreate the missing fragments is as 
follows: In first time slot, p(w 11 ), p(w 10 ), p(w 12 ), nothing, 
p(w 13 ), p(w 7 ) and p(w s ) are downloaded separately by seven 
nodes trying to recreate each of p(l), ■ ■ . ,p(w 6 ) respectively. 
In second time slot p(w 12 ), p(w 8 ), p{w 7 ), p(w w ), p(w n ), 
p(w 13 ) and p(w 14 ) are downloaded. Note that, besides p(w 3 ), 
all the other missing blocks can now already be recreated. In 
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p(w 5 ) 
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TABLE IV 
Scenario: Seven fragments p(l), 



,p(w e ) ARE MISSING 



third time slot, p(w 12 ) can be downloaded to recreate it. Thus, 
in this example, six out of the seven missing blocks could be 
recreated within the time taken to download two fragments, 
while the last block could be recreated in the next time round, 
subject to the constraints that any node could download or 
upload only one block in unit time. 

Even if a full copy of the object (hybrid strategy [161) were 
to be maintained in the system, with which to replenish the 
seven missing blocks, it would have taken seven time units. 
While, if no full copy was maintained, using traditional erasure 
codes would have taken at least nine time units. 

This example demonstrates that SRC allows for fast recon- 
struction of missing blocks. Orchestration of such distributed 
reconstruction to fully utilize this potential in itself poses 
interesting algorithmic and systems research challenges which 
we intend to pursue as part of future work. 

B. On HSRC not being a systematic code 

One can immediately read partial contents of the stored ob- 
ject from systematic encoded blocks. This has both advantages 
- for object retrieval, and disadvantages - in terms of security. 

The fact that HSRC is not a systematic code (every encoded 
fragment contains information about every piece of data) 
makes the object retrieval more costly: this is basically de- 
coding. However, unlike in a classical communication scenario 
where decoding has to be done with whatever corrupted data 
is available, the situation is different here: thanks to the repair 
property, it is possible to have a privileged set of encoded 
fragments to be used for decoding, and if some are missing 
during object retrieval, they can be repaired first. The set of 
encoded fragments to decode can be chosen for being closer 
to systematic blocks than random blocks, or precomputed 
computations can be made available to ease the decoding. 

Having a systematic code can cause security threats. If the 
data is to be stored securely over untrusted storage nodes - as 
may be the case in peer-to-peer systems, but also in cloud/data- 
center environments which may be partially compromised by 
either malicious insiders or hackers, then one would need to 
apply cryptographic techniques on the original object, and then 
store the encrypted object. However, if non-systematic blocks 
are used, then individual encoded blocks do not reveal any 
information. Instead, one would need access to enough (k) 
encoded blocks before being able to read any (and in fact, the 
whole) content. Thus, use of non-systematic encoded blocks 
provides some level of protection, similar in spirit to threshold 
cryptography - in that, even if a small subset of the storage 
nodes are compromised, it does not reveal any content. While 
the level of security is not at par with encryption/decryption by 



14 



node 


p(w { >) 


p(w l ) 


p(w*) 


p(w i ) 


p(u> 4 ) 


p(w b ) 


p(w b ) 


Time 1 
Time 2 


p(w') 
p(w 9 ) 


p(u) s ) 
p(w 10 ) 


p(«I M ) 
p(u) 11 ) 


p(w LJ ) 
p(u> 8 ) 


p(w LL ) 
p(u> 13 ) 


p{w vz ) 
p(u> 14 ) 


p(w w ) 
p(w 7 ) 



the data owner using a secret key, this nevertheless provides an 
intermediate degree of protection, but without the additional 
cost of encryption/decryption, instead amortizing on the en- 
coding/decoding overheads. In fact, Cleversafe employs this 
principle for protection according to their corporate website, 
providing an example of distributed storage scenario, where 
use of systematic encoded blocks are deemed undesirable, and 
in fact, non-systematic blocks are preferred. 

This security argument, coupled with our arguments above 
on how reasonably efficient decoding is possible, thus mitigat- 
ing the adversarial impact of the lack of systematic property 
of HSRC, indicates that HSRC is practical for many storage 
centric application scenarios. 

VII. Conclusion 

We propose a new family of codes, called self-repairing 
codes, which are designed by taking into account specifically 
the characteristics of distributed networked storage systems. 
Self-repairing codes achieve excellent properties in terms 
of maintenance of lost redundancy in the storage system, 
most importantly: (i) low-bandwidth consumption for repairs 
(with flexible/somewhat independent choice of whether an 
eager or lazy repair strategy is employed), (ii) parallel and 
independent (thus very fast) replenishment of lost redundancy. 
When compared to erasure codes, the self-repairing property 
is achieved by marginally compromising on static resilience 
for same storage overhead, or conversely, utilizing marginally 
more storage space to achieve equivalent static resilience. 
This paper provides the theoretical foundations for SRCs, 
and shows its potential benefits for distributed storage. There 
are several algorithmic and systems research challenges in 
harnessing SRCs in distributed storage systems, e.g., design 
of efficient decoding algorithms, or placement of encoded 
fragments to leverage on network topology to carry out parallel 
repairs, which are part of our ongoing and future work. 
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